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Section A 
Al. 


Data augmentation is often used to increase the amount of data you have. Should you apply 
data augmentation to the test set? Explain why. 

[6 marks] 
A2. 


You are doing full batch gradient descent using the entire training set (not stochastic gradient 
descent). Is it necessary to shuffle the training data? Explain your answer. 

[6 marks] 
A3. 


Let p be the probability of keeping neurons in a dropout layer. We have seen that in forward 
passes, we often scale activations by dividing them by p during training time. You accidentally 
train a model with dropout layers without dividing the activations by p. How would you 
resolve this issue at test time? Please justify your answer. 


[6 marks] 
A4. 
Why do the layers in a deep architecture need to be non-linear? 
[8 marks] 
A5. 
Which activation function is represented by the following curve? 
+ 
0:5 
L 1 1 1 j 
-6 -4 -2 0 2 4 6 
[4 marks] 


A6. 


The following figure shows a small convolutional neural network that converts a 13 x 13 image 
into 4 output values. The network has the following layers/operations from input to output: 
convolution with 3 filters, max pooling, ReLU, and finally a fully connected layer. For this 
network we will not be using any bias/offset parameters. How many weights/parameters in 
the convolutional layer do we need to learn? 
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13x13 3@10x10 

r= | 3@5x5 
Convolution Max Pooling Ns 
3 Filters 4x4 2x2 Connected 


Stride 1 Stride 2 
[8 marks] 
A7. 


Suppose an input to a max pooling layer is given below. The pooling size is 2 x 2 with a stride 
2. What would be the output of this pooling layer? 


rape pa po 


EAKIEIEN 


112 | 100 | 25 


[6 marks] 
A8. 


Arrange the following steps correctly to train a neural network model. 
1. Calculate error between the actual value and the predicted value. 
2. Reiterate until you find the best weights of the network. 
3. Pass an input through the network and get values from output layer. 
4. Initialise random weight and bias. 
5. Go to each neuron which contributes to the error and change its respective values to 
reduce the error. 
[10 marks] 
A9. 


Suppose you design a multilayer perceptron for classification with the following architecture. 
It has a single hidden layer with the hard threshold activation function. The output layer uses 
the SoftMax activation function with cross-entropy loss. What will go wrong if you try to train 
this network using gradient descent? Justify your answer in terms of the backpropagation 
rules. 

[6 marks] 
A10. 


Suppose you want to redesign the AlexNet architecture to reduce the number of arithmetic 
operations required for each backprop update. 


(a) Would you try to cut down on the number of weights, units, or connections? Justify your 
answer. 


(b) Would you modify the convolution layers or the fully connected layers? Justify your 
answer. 
[8 marks] 
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A11. 


Suppose you have a convolutional network with the following architecture: 
e The input is an RGB image of size 256 x 256. 
e The first layer is a convolution layer with 32 feature maps and filters of size 3 x 3. It 
uses a Stride of 1, so it has the same width and height as the original image. 
e The next layer is a pooling layer with a stride of 2 (so it reduces the size of each 
dimension by a factor of 2) and pooling groups of size 3 x 3. 
Determine the size of the receptive field for a single unit in the pooling layer. (i.e., determine 
the size of the region of the input image which influences the activation of that unit.) You may 
assume the receptive field lies entirely within the image. 
[10 marks] 
A12. 


Which of the following decision boundaries could be a decision boundary of a neural 
network? Justify your answer. 


__Input data A 


[10 marks] 
A13. 


Which of the following Boolean functions a neural network without a hidden layer cannot 
represent? Justify your answer. 


e AND 
e OR 
e NOT 
e XOR 
[6 marks] 
A14. 


An input image is converted into a matrix of size 28 x 28 and a kernel/filter of size 4 x 4 with 
a stride of 2 and padding 1. What will be the size of the resulting matrix? 

[6 marks] 
Section B 


B1. 


Any two graphs G, and G, are said to be isomorphic if and only if (1) their number of nodes 
and edges are same, and (2) their edge connectivity is retained. Which of the following graphs 
might be isomorphic and which of the following are non-isomorphic? Justify your answer. 
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ear 


[10 marks] 
B2. 
(a) Draw the graph whose incidence matrix is given below 
1001001 0 1 
1 1 00 11100 
011000010 
0 0 1 1111000 
000 000 0 1 1 
[6 marks] 
(b) Comment if the following is a valid incidence matrix of a graph. Why so? Or why not? 
100 1 0 0 
1100141 
O 1 10 0 0 
0 0 1 1 1 1 
01000 1 
[6 marks] 


B3. 


Define permutation invariant and permutation equivariant functions with examples. 
[6 marks] 


B4. 


Define N-way K-shot setting in few-shot learning. Explain with an example. 
[6 marks] 


B5. 


Consider the 3D convolutional neural network defined by the layers in the left column in the 
following table. Fill in the shape of the output volume and the number of parameters at each 
layer. You can write the activation shapes in the format (H, W, T, C), where H, W, T, C are the 
height, width, temporal and channel dimensions, respectively. Unless specified, assume 
padding 1, stride 1 where appropriate. 


Notation: 
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e CONV-K, N denotes a convolutional layer with N filters each of height and width equal to 
K. 

e POOL-K denotes a KxKx K max-pooling layer with stride of K and 0 padding. 

e FLATTEN flattens its inputs, identical to torch.nn.flatten 

e FC-N denotes a fully connected layer with N neurons 


Note: each row in the following table contains 4 marks. 


Layer Dimension of Feature Map Number of Parameters 
Input 56x56x16x3 0 

CONV-3, 16 
ReLU 

POOL-2 
BATCHNORM 
CONV-3, 8 
ReLU 

POOL-2 
FLATTEN 
FC-10 


[36 marks] 
B6. 


(a) What type of information is used to bridge the visual and semantic information in zero- 
shot image classification problem? 


[6 marks] 
(b) How do those information work to solve zero-shot learning? 

[6 marks] 
B7. 
(a) What is generalised zero-shot learning? 

[6 marks] 
(b) Why is it harder than zero-shot learning? 

[6 marks] 


B8. 
Mention two different ways to build a video classification model (Hint: consider which type 


of architectures you could use for video classification). 
[6 marks] 


Section C 
C1. 


Separable Convolution save the compute significantly in Convolutional Neural Networks 
(CNNs). One such popular design is mobileNet which uses depthwise and pointwise 
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convolutional operations. Moreover, modern neural networks are using separable 
convolutions extensively. Consider we have an input feature maps of size HxWxC, where H, 
W and C are height, width and number of channels. If H = W = 56 and C = 512, then 
I) how many parameters are there for a normal convolutional layer with filter size 3x3 
and output channels 512 (ignore bias)? [3 marks] 
Il) How many parameters are there for separable convolution, i.e., depthwise then 
pointwise convolution with filer size 3x3 and output channels 512? How much 
parameter saving separable convolution brings? [3 marks] 


C2. 


The figure below shows multi-head attention using famous scaled dot-product attention. 


Given that X € ee X is the input feature map, N is number of tokens, D is the 
dimensionality of tokens, For simplicity, assume N = 197 and D = 768. 

D What is the role of linear layer depicted after Q, K and V (describe in no more than 50 
words)? What are the mathematical representations to obtain Q, K and V? How can 
you implement it using nn.Conv2d for one head? 

[6 marks] 

ID Why there is a need for multiple heads? (describe in no more than 50 words)? 

[3 marks] 

IID) What is the role of linear layer at the end (describe in no more than 50 words)? How 
can you implement it using nn.Conv2d? 

[6 marks] 


Multi-Head Attention 


Op D 
Scaled Dot-Product | 


Attention p 


c3. 


The figures below are for Supper Token ViT Transformer. 
I) Explain the role of super tokens (in no more than 50 words). 
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[5 marks] 

Il) Explain the intuition behind the STM block, STM inner working and why choices of 
depthwise and pointwise convolutions are made rather than multihead self-attention. 
What could be an alternative to depthwise convolution? (in no more than 150 words). 
[10 marks] 


Detach data tokens and ooto 


send for next WMSA 


ee ee 


F ‘ 
pe22222....-.------_ H ' 
/ Divide into Windows $ emmm i 00000 A 
i } an ian) = 
H A } f H Se i 
DUIN @) (lia, ! 
H H 4 i TAN i d g EJ H ' 
j A |} D i SU} | O g| : i 
=) i MA |e) le ! ! 
El | i S| ieia | 
B) i: wae | Add Pos. | m H H 3 1 
S| 3 tEncodding | 2 PP? 4 A A ' 
= 8 w: | mm |S pi o ipl È i i 
| | aaia a w| i. į |2 ; i 
gj | Ile | s|/ igi |e ’ 
3 i = }; x + £up H g $ \ = De 

i i pon g x i Sae P Seinen 
| MII 4. vy [= ALAA 
eee, a, xL iJ x2 i 00000 A Attach all Super tokens 

i and send to STM 

(a) Overall Architecture (b) Super Token Transformer Blocks 


Super Tokens ViT (Global Interaction Modelling in Vision Transformer via Super Tokens) 


C4. 

Below are two typical images of cross-section of lung cancer tissue. For pathologist and cancer 
specialist it is very important to understand these type images to estimate the survival 
prediction, type of treatment etc. In order for machine to assist the pathologists and medical 
specialists the machine needs to know the cancer type, extent of cancer spread, patient 
survival prediction among many other things. You need to recommend a deep learning 
solution for the problem. What kind of transformer neural network you should use? Discuss 
your choices. What self-supervised learning algorithm you would choose to pretrain your 
transformer? Explain learning principles of the approach you would choose. Discuss the 
merits of the choice you have made. Explain why other self-supervised methods (as examples 
you can mentioned just two of the famous methods) may not be the best choice. (Explain 
your answers in no more than 200 words). 
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C5. 
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Below is an image which may be a camera view of an autonomous car showing multiple 
everyday objects on street. It is critical for autonomous cars of future to understand each 
object and have a segmentation mask for that object. You need to recommend a deep 
learning solution for the problem. What kind of transformer neural network you should use? 
Discuss your choices. What self-supervised learning algorithm you would choose to pretrain 
your transformer? Explain learning principles of the approach you would choose. Discuss the 
merits of the choice you have made. Explain why other self-supervised methods (as examples 
you can mentioned just two of the famous methods) may not be the best choice. (Explain 
your answers in no more than 300 words). 

[25 marks] 


C6. 
Below is the figure of SiT self-supervised vision Transformer. 
I) Why there is a need for Positional embedding? (no more than 50 words). 
[2 mark] 
ID What is the role of reconstruction loss? What benefit it brings? Is it enough to do self- 
supervised pretraining? (no more than 100 words). 
[3 marks] 
III) What is the role of contrastive head and contrastive loss? Is it needed for self- 
supervised learning? (no more than 100 words). 
[3 marks] 
IV) What are the advantages of SiT over other self-supervised methods (at least 3 to 4 
advantages). (no more than 200 words). 
[6 marks] 
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Fig. 1: Self-supervised vIsion Transformer (SiT) 
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